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Wo  describe  ;i  preconditioned  Krvlov  iterative  algorithm  based  on  domain  decomposition  for  im¬ 
plicit  linear  systems  arising  from  partial  differential  equation  problems  which  require  local  mesh 
refinement.  In  order  to  keep  data  structures  as  simple  as  possible  for  parallel  computing  applica¬ 
tions.  the  fundamental  computational  unit  in  the  algorithm  is  a  subregion  of  the  domain  spanned  by 
a  locally  uniform  tensor-product  grid,  suggestively  called  a  tile.  This  is  in  contrast  to  local  refine¬ 
ment  techniques  whose  fundamental  computational  unit  is  a  grid  at  a  given  level  of  refinement.  The 
bookkeeping  requirements  of  such  algorithms  are  potentially  substantial,  since  consistency  of  data 
must  be  enforced  at  points  of  space  which  may  belong  several  different  grids,  and  furthermore,  the 
grids  are  not  necessarily  of  tensor-product  type,  but  more  generally,  unions  thereof.  The  tile- based 
domain  decomposition  approach  condenses  the  number  of  levels  in  consideration  at  each  poini  of 
the  domain  to  two:  a  global  coarse  grid  defined  by  tile  vertices  only  and  a  local  fine  grid,  where  the 
degree  of  resolution  of  the  fine  grid  can  vary  from  tile  to  tile.  Experimentally,  ;f  is  she"  ’'  ''-'rein 
that  one  global  level  and  one  local  !°'-ol  provide  sufficient,  flexibility  to  handle  a  diverse  collection  of 
two-dimensional  problems  which  include  irregular  regions,  non  simply-connected  regions,  non-solf- 
adjoint  operators,  mixed  boundary  conditions,  non-smooth  coefficients,  or  non-smooth  solutions. 
We  employ  from  1  to  1024  tiles  ori  problems  containing  up  to  16K  degrees  of  freedom.  Though  mo¬ 
tivated  by  local  refinement  and  parallel  processing  applications,  benchmark  serial  implementations 
of  the  tile-based  algorithm  on  uniform  grids  produce  iteration  counts  and  execution  times  which 
are  competitive  with  those  of  traditional  global  preconditionings. 
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1.  Introduction 


i  in'  combination  of  domain  decomposition  with  preconditioned  iterative  methods  provides  a 
framework  which  extends  the  usefulness  of  numerical  techniques  for  certain  special  partial  differ¬ 
entia]  equation  problems  to  those  of  more  general  structure.  Non-smooth  features,  noil-separable 
geometries,  or  massive  sizes  of  practical  problems  limit  the  application  of  many  "standard"  nu¬ 
merical  techniques.  Direct  methods  are  rapidly  defeated  by  problem  size.  "Fast"  methods  which 
take  advantage  of  special  coefficient  and  grid  structure  often  do  not  apply  globally.  Iterative 
methods  often  depend  for  efficient  implementation  on  regular  grids  which,  if  global  in  extent,  are 
inconsistent  with  accurate  and  economical  resolution  of  the  physics  of  the  problem.  However, 
the  domains  of  piobletns  with  these  features  can  often  be  decomposed  into  smaller  subdomains 
of  simpler  structure,  increasing  the  utility  of  extant  software  libraries,  particularly  as  components 
o!  preconditioners.  Moreover,  the  domain  decomposition  can  be  made  to  produce  a  transparent 
mapping  of  many  problems  onto  medium-scale  parallel  computers.  Our  primary  focus  in  this  paper 
is  the  incorporation  of  spatially- varying  mesh  refinement  requirements  into  a  finite-difference- based 
domain  decomposition  algorithm.  We  illustrate  the  convergence  behavior  of  the  algorithm  on  a  va¬ 
riety  of  two-dimensional  elliptic  PDE  proc  mis.  including  non-self-adjoint,  non-separable  geometry 
rases.  We  also  point  out  features  of  the  method  which  are  relevant  to  a  parallel  implementation 
but  defer  the  corresponding  complexity  analysis  to  a  subsequent  companion  paper. 

Many  PDE  problems  which  are  "large”  in  the  discrete  sense  are  so  because  the  continuous 
problems  from  which  they  are  generated  require  resolution  of  several  different  length  scales  for  the 
production  of  a  meaningful  solution.  The  value  of  compromising  between  the  extremes  of  globally 
uniform  refinement,  which  leads  to  simple  and  usually  vectorizable  algorithms  but  wastes  time  and 
memory,  and  pomiwise  adaptive  refinement,  which  minimizes  the  discrete  problem  size  for  a  given 
accuracy  requirement  but  leads  to  complicated  data  structures,  has  been  recognized  for  some  time 
and  described  in  contexts  too  numerous  to  acknowledge  fairly.  Locally  Uniform  Mesh  Refinement 
(LI  MR)  characterizes  one  such  class  of  discretizations,  based  on  composites  of  highly  structured 
subgrids.  Many  treatments  of  LUMR  in  the  literature  pertain  to  explicit  methods  for  transient 
problems,  a  class  with  its  own  advantages  (see  [3]  and  references  therein)  and  limitations  [30] 
which  is  somewhat,  distinct  from  ours.  Implicit  t roat men ts  of  locally  regular  refinement  for  elliptic 
problems  include  approaches  arising  out  of  classical  multigrid  (see  [31]  and  references  therein),  a 
nonconforming  spectral  technique  [30],  and  methods  rooted  in  iterative  substructuring  for  finite 
element  problems  [">]. 

Computationally  pract  ical  locally  uniform  grids  are  usually  expressible  as  t  he  union  of  a  coarse 
uniform  tensor-product  grid  covering  the  entire  domain  with  one  or  more  refined  tensor-product 
grids  defined  over  subregions,  including  the  possibility  of  multiple,  nested  levels.  Generalizations 
of  this  within  the  I.f  MR  framework  include  allowing  the  gruls  at  any  particular  level  of  refinement 
to  themselves  be  the  union  of  tensor-product  subgrids,  and  reinterpreting  "uniform"  as  "quasi 
uniform  to  aiiow  general  curvilinear  coordinates  for  custom  body-  or  solution-fitting.  We  select 
for  consideration  a  rather  restricted  form  of  LUMR  in  which  refinement  occurs  exclusively  within 
complete  cells  of  a  quasi-uniform  coarse  grid,  as  described  in  section  2  below. 

i  lie  goal  of  the  present  contribution  is  an  LUMR  methodology  with  starkly  simple  data  struc¬ 
tures.  for  efficient  portability  to  a  variety  of  parallel  machines.  It  borrows  from  the  mesh  refinement 
arid  domain  decomposition  literature  and  from  the  authors'  own  experience  in  these  areas  and  in 
parallel  computation  [20.  22.  2*j.  In  our  pursuit  of  convenience  and  overall  parallel  performance. 

* n  v-' h : ~ ^  include  both  •! o t speedup  and  efficiency,  we  are  ready,  potentially,  to  compromise 
"optimality  as  'b-fined  by  conventional  serial  computing  measures,  for  example,  by  refining  only 
it;  units  of  full  coarse  grid  cells,  we  mav  impose  a  tendency  towards  refinement  in  regions  where  it 
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Figure  1:  The  anatomy  of  a  tile.  Unless  closed  by  a  physical 
boundary,  a  tile  is  open  along  its  high-z  and  high- 1/  perimeter. 

would  be  unnecessary  from  a  truncation  error  point  of  view  alone.  As  another  example,  our  con¬ 
vergence  rate  is  dependent  upon  a  coarse  grid  resolution  which  may  be  chosen  with  criteria  beyond 
convergence  rate  in  view,  such  as  the  balance  of  work  among  multiple  processors.  Fortunately,  the 
methodology  survives  such  compromises  and  is  even  sequentially  advantageous  in  many  problems. 

The  domain  decomposition  algorithms  we  employ  (section  3)  involve  “nearly"’  parallel  precon¬ 
ditioners  in  conjunction  with  generalized  minimum  residual  (GMRKS)  iteration,  a  non-stationary 
method  not  dependent  upon  operator  symmetry.  In  two  dimensions,  the  preconditioner  involves 
three  separate  phases:  a  global  coarse  grid  solve,  independent  solves  along  interfaces  between  sub- 
domains,  and  independent  solves  in  the  subdomain  interiors.  The  global  coarse  grid  solve,  which 
we  do  directly,  is  an  essential  feature  as  it  provides  the  only  global  exchange  of  information  in  the 
preconditioner  itself.  We  will  compare  alternative  formulations  of  the  more  negotiable  interface 
and  subdomain  solves. 

The  main  body  of  the  paper  is  the  collection  of  numerical  experiments  on  two-dimensional  ellip¬ 
tic  boundary  value  problems  in  section  4.  The  experiments  include  standard  model  problems.  "L"- 
shaped,  “T"’-shaped,  and  non-simply-connected  regions,  non-self-adjoint  operators,  mixed  bound¬ 
ary  conditions,  and  problems  with  non-smooth  coefficients  or  non-smooth  solutions.  We  use  from  1 
to  1024  coarse  grid  elements  on  problems  containing  up  to  16K  degrees  of  freedom.  Among  our  find¬ 
ings  is  that  the  interface  probe  preconditioning  advocated  in  our  earlier  work  on  convective-diffusive 
systems  with  stripwise  decompositions  [28]  does  not  perform  as  well  on  decompositions  with  inter¬ 
nal  vertices  as  the  much  simpler  tangential  operator  preconditioning.  We  also  demonstrate  that 
incomplete  factorizations  are  not  c  .  -effective  subdomain  interior  preconditioners,  relative  to  exact 
subdomain  solves,  once  the  subdort,  oecome  sufficiently  narrow. 

2.  Mesh  refinement  by  tiles 

In  this  section  we  describe  a  simple  mesh  refinement  philosophy  based  on  a  regular  tessellation 
of  the  global  domain  into  subdomains  which  we  call  “tiles"’  in  two  dimensions.  Mathematically, 
a  tile  is  the  tensor-product  of  h^if-open  intervals  in  each  coordinate  direction,  except  that  a  tile 
abutting  a  physical  boundary  along  what  would  ordinarily  be  one  of  its  open  edges  is  closed  along 
that  edge.  Each  tile  possesses  its  own  tensor-product  discretized  interior,  at  least  two  of  its  four 
sides,  and  at  least  one  of  its  four  corners.  Although  the  specific  convention  is  arbitrary,  we  assume 
for  definiteness  that  in  its  own  local  right-handed  coordinate  system,  each  tile  contains  its  origin 
and  its  z  and  ij  axes  (see  Figure  1). 

In  cortrast.  to  physical  boundary  segments,  we  refer  to  the  artificial  decomposition-induced 
boundaries  of  the  tiles  as  “interfaces’".  We  refer  to  the  points  at  the  intersection  of  all  boundaries. 
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Figure  2:  Sample  Tessellations,  (a)  is  permissible,  (b)  is 
not. 

physical  or  artificial,  as  “cross-points'’.  We  require  that  the  cross-points  be  embeddable  in  a  tensor- 
product  global  quasi-uniform  coarse  grid,  from  which  only  points  lying  exterior  to  the  (possibly 
multiply  connect)  boundary  are  missing.  This  rules  out  irregular  tiling  patterns  such  as  in 
Figure  '2b.  However,  there  is  no  requirement  that  the  domain  itself  be  of  tensor-product  type:  the 
decomposition  in  Figure  2a  is  permissible. 

Associated  with  each  tile  is  the  data  defined  over  a  quasi-uniform  grid  covering  its  portion 
of  the  domain  and  a  set  of  operators  for  executing  its  block-row  portions  of  the  preconditioner 
solve  to  be  described  later.  In  our  object-oriented  approach,  these  operators  can  potentially  be  of 
different  types  for  different  tiles.  For  computational  convenience,  we  assume  throughout  that  the 
grids  covering  individual  tiles  are  derived  from  the  coarse  grid  of  cross-points  through  refinement  in 
ratios  of  powers  of  two.  We  can  therefore  indicate  refinement  levels  using  the  graphical  shorthand 
of  Figure  10  where  the  integer  indicates  the  logarithm  of  the  refinement  ratio. 

2.1.  Tile-tile  interfaces 

In  order  to  minimize  restrictions  on  the  structure  of  adjacent  tiles  (and  to  eliminate  redundant 
communication  between  tiles  in  a  multiprocessor  implementation,  in  which  different  tiles  might  be 
assigned  to  different  processors),  each  tile  stores  and  maintains,  in  addition  to  its  own  data,  the 
data  associated  with  a  buffer  region  of  phantom  points  equal  in  width  to  one-half  of  that  of  its 
associated  finite  difference  stencil  (see  Figure  3).  Excluding  the  redundant  phantom  points,  each 
point  of  the  domain  is  uniquely  associated  with  a  single  tile. 

Data  at  the  phantom  points  is  supplied  in  a  manner  dependent  upon  the  internal  structure  and 
refinement  ratios  of  the  adjacent  tiles  in  question.  A  finer  tile  obtains  bi-quadraticallv  interpolated 
data  from  its  coarser  neighbor.  Since  the  problems  studied  herein  involve  second-order  operators, 
this  allows  the  use  of  conventional  finite  difference  techniques  in  generating  the  difference  equations 
at  the  subdomain  interfaces.  Bi-linear  interpolation  alone  would  limit  the  potential  accuracy  of  a 
second-order  differencing  scheme,  as  observed  in  some  preliminary  experiments.  We  note  that  such 
a  difference  scheme  does  not  guarantee  discrete  flux  conservation.  Our  focus  herein  is  simply  on 
the  solution  of  a  consistent  set  of  discrete  equations.  More  careful  attention  to  the  discretization 
has  already  been  given  in  the  context  of  locally  regular  refinement  in  [19], 

All  of  our  examples  employ  strictly  uniform  local  grids.  Although  this  is  not  a  necessary 
restriction  of  the  method,  this  simplifies  the  exchange  of  data  between  adjacent  tiles. 

The  coarse  grid  system  obtains  its  data  by  simple  ( unweighted )  injection.  That  is.  the  value 
at  the  point  in  the  finer  neighboring  tile  that  lies  on  the  coarse  grid  stencil  is  used  for  the  coarse 
grid  point.  A  weighted  averaging  could  be  employed  to  preserve  operator  symmetry,  if  that  were 
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Figure  3:  Sample  tile,  showing  the  computational  buffer 
region  required  for  the  standard  five-point  stencil. 

necessary  for  other  reasons,  for  instance,  conjugate  gradient  iteration  of  a  sen-adjoint  problem. 
A  finite  element  discretization  with  transition  elements  along  the  interface  would  unainujb  ■■  'wv 
deliver  the  appropriate  weighting  coefficients  in  this  case. 

The  selection  of  refinement  criteria  is  a  much  studied,  yet  still  open  problem;  see  [2]  and 
[26]  for  a  sampling  of  work  in  this  area.  The  refinement  criteria,  however,  are  orthogonal  to  the 
equation-solving  aspect  considered  herein,  except  to  the  extent  that  a  part  of  the  computational 
work  required  by  one  o?  these  tasks  may  be  a  by-product  of  the  other.  Some  issues  in  refinement 
criteria  will  be  discussed  in  a  subsequent  report  [23] .  For  present  purposes,  v"?  give  one  example 
with  a  smooth  solution  but  non-smooth  coefficients  and  others  with  smooth  coefficients  but  a 
non-smooth  solution.  In  these  examples.  "good'1  refinement  strategies  can  be  done  “by  hand  . 

In  general,  tile  interfaces  can  be  the  site  of  changes  in  the  discretization  besides  just  the 
refinement  level.  For  instance,  the  discrete  stencil  can  change  order  at  interfaces.  Even  the  form 
of  the  operators  or  their  number  can  change  at  interfaces  while  still  preserving  the  subdomain 
uniformity  required  for  efficient  subdomain  solution  algorithms.  As  a  motivational  example,  a 
reacting  flow  problem  frequently  consists  of  large  regions  in  which  there  is  only  transport  of  mass, 
momentum,  and  thermal  energy  but  no  reaction  among  constituents  of  known  composition,  to  all 
adequate  orders  of  approximation.  fii  other  regions  it  is  essential  to  retain  composition  variables, 
because  they  diffuse  differentially,  and  in  a  subset  of  these,  reaction  terms  must  also  be  retained 
in  the  equations.  To  accommodate  such  generality,  the  routines  that  pack  the  buffer  regions  are 
responsible  for  providing  the  necessary  mappings. 

2.2.  Physical  boundaries 

For  generality,  the  equations  for  the  physical  boundaries  are  incorporated  into  the  overall 
system  matrix,  including  Dirichlet  conditions.  Our  implementation  allows  inhomogeneous  Robin 
boundary  conditions  at  all  boundary  points,  namely, 

+  *(-r,  t/)«  =  c(x.y). 
an 

Both  first-  and  second-order  one-sided  difference  approximations  to  the  normal  derivative  term  are 
employed.  The  second-order  approximation  is  used  in  the  actual  operator,  and  the  first-order  is 
used  in  the  preconditioners  (to  preserve  uniformity  of  the  bandwidth  of  the  matrices  used  in  the 
preconditioning).  Though  tempting  in  their  simplicity.  Dirichlet  boundary  conditions  alone  in  the 
preconditioner  were  found  to  perform  poorly  in  practice,  in  accord  with  expectation  from  the  theory 
in  [33]  and  references  therein. 
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Figure  4:  One-dimensional  schematic  of  the  tile  basis  func¬ 
tions. 

2.3.  Comparison  with  other  approaches 

In  contrast  to  multi-level  approaches  in  which  the  fundamental  computational  unit  is  a  grid 
at  a  given  level,  our  fundamental  computational  unit  is  a  subregion  of  the  domain.  The  present 
approach  requires  only  one  grid  which  possesses  connectivity  with  arbitrarily  distant  regions  of  the 
domain,  namely  the  coarsest  one.  In  the  framework  of  the  hierarchical  basis  function  technique 
[43],  we  have  simply  a  two-level  hierarchy,  but  the  the  higher  level  may  be  different  in  different 
subregions.  Figure  4  gives  a  one-dimensional  illustration.  This  admittedly  represents  a  severe 
condensation  of  the  range  of  intermediate  scales  present  in  multi-level  local  uniform  refinement, 
on  which  much  of  the  asymptotic  convergence  theory  is  based.  Tiles  are  much  closer  to  being  the 
software  equivalent  of  the  “geometry-defining  processors"  (GDPs)  of  Dewey  and  Patera  [T»] 

The  tile  approach  is  also  similar  to  the  additive  Schwartz  method  [16,  41]  and  the  techniques 
of  [6]  in  its  reliance  upon  just  a  single  domain-spanning  grid.  The  main  difference  between  these 
techniques  and  the  tile  approach  is  in  the  treatment  of  the  interfacial  degrees  of  freedom.  In  the 
additive  Schwarz  technique,  interior  problems  are  solved  on  extended  overlapped  subdomains,  of 
which  the  intcrfacial  degrees  of  freedom  are  interior  points  and  thus  demand  no  special  consider¬ 
ation.  In  [6],  good  preconditioners  for  the  interfacial  degrees  of  freedom  are  derived  theoretically, 
for  self-adjoint  operators.  Optimal  algebraic  convergence  (independent  of  degree  of  refinement) 
has  been  proved  for  both  classes  of  algorithms  in  [18]  and  [5],  so  there  are.  intuitively,  grounds  for 
optimism  about  single  global-grid  algorithms  even  though  we  present  no  extensions  of  the  theory 
to  the  non-self-adjoint  problems  we  consider.  The  main  disadvantage  in  condensing  out  interme¬ 
diate  scales  is  that  the  coarse  grid,  on  which  all  optimal  approaches  require  an  exact  solve,  cannot 
necessarily  become  as  coarse  as  one  might  like. 

The  field  of  locally  uniform  mesh  refinement  is  spanned  by  a  continuum  of  resolution  strategies 
governed  by  clustering  rules  which  control  the  size  and  shape  of  the  refined  subregions.  Global 
refinement  lies  at  one  extreme  and  pointwise  adaptive  refinement  at  the  other.  As  soon  as  the 
global  tensor  product  mesh  is  abandoned  a  host  of  difficult  practical  decisions  need  to  be  made 
about  data  structures  and  clustering  algorithms.  The  logic  required  to  handle  the  numerous  types 
of  siibgrid-subgrid  interactions  which  can  arise  and  t.o  insure  the  consistency  of  the  data  structure 
is  a  significant  impediment  to  efficient  parallelism.  In  contrast,  "horizontal”  neighbor-neighbor 
interactions  are  simple.  The  sufficiency  of  a  two-level  approach  in  obtaining  reasonable  convergence 
is  demonstrated  in  section  I.  Compelling  superiority  of  approaches  with  a  greater  richness  of  scales 
has  not  yet  beep  full v  established  in  production  parallel  software,  although  it  may  be  ultimately. 

■’> 


Experience  on  parallel  computers  gained  from  a  two-level  approach  will  be  beneficial  in  any  even' . 

3.  Iterative  domain  decomposition  algorithms 

As  mentioned  in  the  introduction,  preconditioned  iterative  methods  and  domain  decomposition 
provide  a  framework  suitable  for  the  description  of  a  wide  class  of  algorithms.  The  four  common 
elements  of  this  framework  are:  a  global  operator  arising  from  the  discretization  of  the  PDF.  nr 
system  of  PDEs);  an  approximate  inverse,  or  preconditioner,  for  the  global  operator:  an  iterative 
method  relying  only  on  repeated  application  of  the  preconditioned  operator:  and  a  geometry- 
based  partition  of  the  discrete  unknowns  so  that  size,  locality,  and  uniformity  can  be  exploited 
in  applying  the  preconditioned  operator.  Since  the  numerical  analysis  literature  contains  man;.' 
successful  discretization  schemes  and  iterative  methods  specialized  for  different  operator  properties, 
such  as  the  presence  or  absence  of  definiteness  and  symmetry,  the  recent  burgeoning  effort  in 
iterative  domain  decomposition  algorithms  has  concentrated  primarily  (though  not  exclusively  >  on 
the  interaction  of  the  second  and  fourth  of  these  elements.  In  the  parallel  context,  this  is  a  natural 
preoccupation  because  the  bottleneck  to  parallelism  usually  (though  not  exclusively)  lies  in  Tie 
requirement  of  the  global  transport  of  information  in  the  preconditioner. 

Many  of  the  numerical  examples  described  in  section  4  rule  out  the  use  of  iterative  meth¬ 
ods  based  on  symmetry,  but  permit  the  assumptions  ol  definiteness  and  diagonal-dominance.  In 
particular,  full  or  incomplete  factorizations  of  subdomain  matrices  can  be  undertaken  wimout 
pivoting.  Because  ol  its  robustness,  we  join  many  recent  users  [13.  32.  38.  42]  in  adopting  'he 
parameter-free  generalized  minimum  residual  (GMRES)  method  [37]  as  the  outer  iteration.  1  he 
main  disadvantages  of  GMRES.  rs  linear  and  quadratic  (in  iteration  index)  memory  and  execution 
time  requirements,  respectively,  must  be  mitigated  by  scaling  and  preconditioning.  For  other  ac¬ 
celeration  schemes,  such  as  Chebvshev.  the  memory  and  execution  time  requirements  may  be  only 
constant  and  linear,  respectively,  but  GMRES  dispenses  with  the  difficulty  of  estimating  param¬ 
eters.  The  primary  type  of  decomposition  used  herein  involves  roughly  unit  aspect  ratio  tiles,  as 
opposed  -o  thin  strips.  Ordering  the  interior  points  (and  the  physical  boundary  points  other  than 
cross-points)  first,  the  cross-points  last,  and  the  interfaces  connecting  the  cross-points  in  between, 
gives  a  nested-dissection-like  "arrow”  matrix  appearance  to  the  global  discrete  operator,  which  wo 
denote  .4.  The  basic  structure  of  our  preconditioner  B  is  the  block-upper  triangular  portion  of 
'he  arrow  matrix.  The  application  of  B~l  thus  begins  with  a  cross-point  solve,  which  updates  the 
right-hand  sides  of  a  s°t  of  independent  interface  solves.  These,  in  turn,  update  the  right-hand 
sides  of  a  set  of  interior  solves.  For  a  nine-point  stencil,  the  cross-point  result  would  also  update 
the  interior  right-hand  sides.  However,  there  is  no  dependence,  within  a  single  iteration,  of  Tie 
interface  solution  upon  the  result  of  the  interior  solution,  or  of  the  cross-point  solution  upon  ei¬ 
ther.  <  In  f  1 1  ] .  structurally  symmetric  arrow  matrix  preconditioners  were  compared  against  the 
corresponding  triangular  forms  on  a  variety  of  strip-wise  decomposed  problems.  It  is  found  therein 
that  retaining  the  interior-to-interface  coupling  in  the  preconditioner  generally  reduces  the  total 
number  of  iterations  required  to  attain  a.  fixed  convergence  criterion,  but  that  the  execution  time  <4 
the  structurally  symmetric  algorithm  is  greater,  because  of  the  cost  of  the  extra  set  ot  subdomain 
solves  in  ea!  h  iteration.  The  first  and  second  sets  of  subdomain  solves  are  inherently  sequential.* 

The  derivation  of  the  coefficients  of  the  preconditioner  blocks  is  as  follows.  The  cross-point 
equations  are  simply  a  scaled  coarse  grid  discretization  of  the  continuous  PDF..  Physical  bound. nr;, 
points  lying  at  tile  comers  are  retained  in  the  cross-point  system  in  order  to  accommodate  first .. >p;.t 
Neumann  or  mixed  conditions  in  this  coarse  strid  discretization  Weighted  averaging  possibility-  !*>r 
the  derivation  of  the  coarse  grid  operator  arise  from  t he  posse.-.sion  of  t he  coefficients  and  right  -  :ia :i>! 
side  on  titier  grids  -urrounding  each  cross-point,  but  those  are  not  currently  exploited.  1  he  current 
implementation  -upports  I. F- based  Gaussian  elimination  on  the  coarse-. grid  system.  This  -ope  ;- 
‘  lie  'he-!  parallel  bottleneck  in  the  precond  it  ioner  and  can  be  performed  in  either  ol  two  wav-: 


redundantly  on  each  processor  after  broadcasting  the  required  coefficient  data  for  small  -y-t. m:-. 
or  in  .1  fully  distributed  lashion  lor  large  systems.  Determination  of  the  most  efficient  technique  i- 
generally  domain  and  network  dependent.  If  strip  decompositions  are  used,  there  is  no  cross-puni; 
system,  and  the  lower-right  block  of  the  preconditioner  is  simply  the  interface  system  described 
below. 

I  he  rile  interior  equations  consist  of  fine  grid  discretizations  of  the  PDF,  over  local  region-, 
with  physically  appropriate  boundary  conditions  along  any  true  boundary  segments  and  Dirich- 
let  boundary  conditions  at  artificial  interfaces.  Only  first-order  differences  are  accommodat'  d  in 
the  physical  boundary  conditions  of  the  preconditioner,  even  if  higher-order  are  employed  in  the 
operator  .1.  The  current  implementation  supports  full  LU  Gaussian  elimination,  incomplete  Id 
decomposition,  or  modified  incomplete  Ltd  decomposition.  Each  tile  performs  its  interior  -oive 
completely  independently. 

f  alike  the  coarse  grid  and  tile  interior  equations,  which  bear  the  physical  dimension  of  die 
underlying  PDE  and  have  natural  preconditionings,  the  lower-dimensional  interfacial  equation?-  a r<> 
properly  derived  from  a  related  pseudo-differential  operator,  a  theoretically  well-developed  approach 
we  do  not  pursue  here  because  of  the  difficulty  in  applying  it  to  arbitrary  problems.  Instead.  w< 
have  compared  three  approaches  referred  to  below  as  (a)  tangential,  (b)  truncated,  and  ( c i  interface 
probe.  I  he  tangential  interface  preconditioner  is  the  one-dimensional  discretization  of  the  I'tiii- 
ol  the  underlying  operator  which  remain  when  the  derivatives  normal  to  the  interface  are  -et  m 
zero.  Fhe  truncated  interface  preconditioner  is  a  discretization  of  the  full  underlying  operator, 
with  the  coefficients  associated  with  non-interfacial  unknowns  set  to  zero.  The  interface  probe 
preronditioner  has  been  described  elsewhere  [0,  29]  as  a  low-bandwidth  approximation  to  the  true 
capacitance  matrix  of  the  interfacial  unknowns  in  the  ambient  matrix  corresponding  to  the  degree* 
of  freedom  of  the  interface  itself  and  the  two  subdomain  interiors  on  either  side. 

I  he  differences  between  these  three  techniques  are  perhaps  most  easily  visualized  by  consider¬ 
in'.!  the  example  of  I  aplace’s  equation  on  a  uniformly  discretized  square  partitioned  by  an  interface 
parallel  to  one  pair  of  edges  into  subdomains  I  and  2.  the  interfacial  unknowns  being  subscripted  :i. 
I.et  ,lu  and  .4 22  be  the  subdomain  operators,  let  .-D3  and  .-I23  translate  the  values  on  the  interface 
into  the  respective  subdomain  boundary  condition  right-hand  side  vectors,  and  vice  versa  for  .G, 
and  . \  >2 .  The  tangential  preconditioner  is  the  tridiagonal  matrix  with  diagonal  elements  -2  and 
sub-  and  super-diagonal  elements  1.  The  truncated  preconditioner  is  the  same  except  for  —  t’>  mi 
tb°  diagonal.  1  ue  interface  prone  pieconuilGno.  is  the  truncated  preconditioner  minus  a  diagonal 
matrix  whose  elements  are  those  of  the  vector  [.-lai-lj"/  .4 13  +  .4  32  .-ED1  -  f 2.3] ^ •  where  t  is  the  v«v'or 
ot  all  1  s.  The  probe  preconditioncr  has  the  same  row  sum  as  the  actual  Schur  complement  matrix 
for  the  interface,  namely  .4 ->3  —  . l3]  .l]",1  ,-ln  -  .4  32 .4  .4  2.3.  These  three  preconditioner  matrix  types 

differ  only  along  the  diagonal,  with  the  elements  of  the  probe  diagonal  lying  between  the  first  two. 

4.  Numerical  experiments 

I  he  numerical  experiments  of  this  section  serve  to  illustrate  the  effectiveness  of  the  domain 
decomposition  methods  employed  in  terms  of  the  convergence  of  the  iterations  and  also  the  etf'ec- 
livened  of  the  locally  uniform  mesh  refinement  in  terms  of  the  convergence  of  the  discretization. 

4.1.  Model  problems 

Ue  present  twelve  model  problem  ,  each  containing  a  single  dependent  variable  and  two  inde¬ 
pendent  vu  ria  I  lies.  1  he-o  rest  rict  ions  on  the  number  of  variables  beg  generalization .  so  we  comment 
bneflv  at  the  out-et.  Multiple  dependent  variable  cases  have  been  examined  for  stripwise  decom- 
po-i'ion-  in  d't!  and  will  be  presented  for  the  current  cross-point  of  the  algorithm  in  a  subsequent 
paper  orient.-  ,1  t .  1  ward  a  a  p  plica  t  ions.  The  extension  of  our  current  t  ech  niques  to  t  breed  intension  a  I 
problems  is  straightforward,  but  not  necessarily  effective.  Optimal  or  near  optimal  algorithms  for 
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Figure  5:  The  four  domains  considered  in  this  paper. 

three-dimensional  problems  are  known  which  require  a  more  implicit  lower-right  corner  block  oi  'lie 
preconditioner,  containing  more  than  cross-points  alone  [l‘>.  17],  We  have  not  yet  examined  them 
so-called  "wire- basket"  forms  of  the  preconditioner.  From  a  parallel  perspective,  they  introduce 
additional  sequential  overhead,  and  a  careful  consideration  of  the  trade-oris  between  convenience 
rate  and  cost  per  iterations  will  be  required  in  a  future  study. 

Some  of  the  problems  below  are  self-adjoint  and  could  be  discretized  in  a  symmetric  manner 
and  perhaps  solved  mor<-  cheaply  with  conjugate  gradients  than  with  G.MRLS.  Our  main  interest, 
however,  is  in  the  more  extensible  formulation.  In  all  the  examples  to  follow  except  for  the  last, 
an  exact  solution  of  the  continuous  problem  Cu  =  /  is  specified.  From  this  u.  all  of  the  following 
source  terms  /  and  boundary  condition  mhomogeueil ies  >7  may  b«  r.-dculafed.  In  cases  where  the 
expressions  for  /  and  <j  are  sufficiently  simple,  they  are  written  out  along  with  the  solution.  1  Im 
twelve  problems  include  four  different  domains,  pictured  in  Figure 

File  first  two  examples,  with  constant  coefficients  .and  an  exact  solution  quadratic  in  each 
independent  variable,  are  extremely  simple  and  possess  truncation-error-free  second-order  finite 
difference  representations.  They  are  identical  except  for  the  type  of  boundary  conditions  along  one 
side  of  their  square  domain.  These  problems  are  not  candidates  for  mesh  refinement:  rather,  'hey 
are  chosen  to  show  the  deterioration  in  convergence  rate  caused  when  Dirichlet  boundary  condition.-' 
are  replaced  with  Neumann,  and  to  allow  controlled  experimentation  on  the  effect  of  inaccurate 
boundarv  conditions  in  the  preconditioner.  The  poor  convergence  of  #'2  using  the  precondit inner 
of  #1  led  to  tlie  decision  to  expand  the  cross-point  system  to  include  physical  boundary  point-  m 
t  he  general  case. 
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hie  at  th*'  discrete  level  from  an  order-of-tnat'nit udo  physical  domain 
problem. 


Anisotropic  diffusion. 


0  /  du\  i)~u 

—  a  —  a. - =  2 1  i  t  -  1  i 

i)r  \  i)x  )  i)y- 

>i(  s.  y )  =  x~  r  y~ 

a  =  10 

Dirichlet  dat  a  on  <75? 

5?  —  lint  square 

u‘  fourth  example  is  a  prototype  convection-diffusion  problem:  a  passive*  scalar  in  ; 
is  fully  developed  at  the  outflow.  It  is  a  companion  problem  to  #2  in  the  sense  of 
>th  .solution  with  on*-  Neumann  boundary,  but.  asymmetry  due  to  the  convection, 
opv  comes  from  a  first-order  operator,  it  is  also  an  interestin'*  complement  to  #.'i 

m  :t  I:  Pine  flow  convection-diffusion  with  fully -devi>loped  outflow  boundary. 
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I'  he  next  t  wo  examples  (the  first  two  from  the  stamlar'l  "population"  of  elliptic  profile::. - 
■  io.  iiti; )  brills'  in  non-con.-i  ant  coefficients,  the  latter  in  a  non -self- adjoint  way  with  Robin  bon  non  ■ 
comiit  ions.f 

Problem  # Self-adjoint,  non-constant  coefficient .  Dirirhlet  boundaries. 
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Problem  Non-self-adjoint,  non-constant  coefficient,.  Robin  boundaries. 
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The  derivative  is  the  outward  normal. 

The  seventh  example,  from  [1,  27],  has  a  smooth  solution,  but  rapidly  varying  coefficients  alone 
an  internal  layer.  Here,  the  solution  itself  gives  no  hint  of  the  requirement  of  mesh  refinement. 
Interestingly,  the  locations  of  maximum  error  in  a  uniformly  refined  discretization  of  the  PUP  do 
not,  e\  .*ii  occur  at  the  internal  I  aver  itself,  but  towards  t  he  interiors  of  the  two  subdomains  it  <  1  i  v  i  <  i  •  - 

ju.; 

Problem  #7:  Internal  layer. 
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1  in-  n  •  ■  x  i  t  free  examples  are  obtained  by  taking  1  lire.*  different  value-  of  r  fie  couve,  > *■  , 
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Problem-  tt  s  ID:  <  “vlind  rbaliv -separable  reentrant  corner  conveci  mn  ditlu-ion  problem 

*  Ih**  ms  .r»*  wpjflv  rH'-r^n*  *•  •  niii.uiis  ;m  i']>nti<  il  iM  mm  <>f  #  '*  url  .1  Mini!  u  lot  u**r  i-i*  uH 

•  *f  it*.  \  t  \  *  |  *1 » i'  -il  mt-.t  mi  f  h**  I  ftt  t  r  r^u*J»-rs  if  ill- j  I 


in 


where  r  =  v/('r  -  1 )-’  +  ((/  -  l}2 
and  6  —  arg^-r  —  J )  f  t(y  —  1}),  0  <  9  <  2k 
Diriclilet  data  on  uQ 
n  =  I. -shaped  legion 

1  he  first  of  these  corresponds  to  [jure  diffusion,  and  the  second  and  third  to  convection  in  towards 
the  reentrant  corner,  ami  away  from  it  respectively,  at  a  rate  inversely  proportional  to  radius. 
The  respective  values  of  "he  radial  eigenfunction  exponent  o  are  -2,  1,  and  approximately  10.01 12. 

/2.  The  first  two  solutions  of  this  trio  lack 
derivatives  at  the  reentrant  corner.  The  last  is  everywhere  twice-diffe-rentiable.  but  the  solution 
is  characterized  by  steep  variation  in  the  three  non-reentrant  corner  regions,  where  r  >  1.  Local 
mesh  refinement  is  critical  to  impiwing  the  accuracy  of  a  finite-difference  solution.  In  addition  to 
refinement,  a  simple  change  to  Dm  finite  difference  scheme  in  the  vicinity  of  the  reentrant  corner  is 
made  that  substantially  improves  the  accuracy  of  the  sol-  tion:  this  is  described  in  more  detail  in 

The  eleventh  example,  from  [-1,  27],  illustrates  now  an  irregularly-shaped  domain  may  force  a 
minimum  granularity  'pon  a  tessellation  comprised  ol  congruent  tiles.  For  the  problem  at  hand, 
the  minimum  granularity  is  near  the  ideal  one. 

Problem  #11:  T-shaped  domain. 
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The  last  example,  from  [7],  is  provided  to  illustrate  the  accommodation  ol  non-simply-conac"  ’■  d 
domains.  Again,  the  geometry  imposes  a  minimum  granularity  on  congruent  tiles. 

Problem  #12:  Two-hole  domain. 
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perspect  i  vp  -uj  i  face  plots  of  t  lie  sol  ut  iom  tot  Imse  t  we|  v<>  problem'-  are  given  in  Figures  band  i 
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Figure  6:  Surface  plots  of  fhe  test  problem  solutions:  (a) 
#1-3.  (b)  #4.  (c)  #5,  (cl)  #6.  (e)  #7.  (f)  #11.  (Note  that 
the  solution  to  #1 1  is  smooth:  the  apparent  fronts  are  due  to 
zeroing  the  surface  over  the  undefined  regions  of  the  T-shaped 
domain. ) 


4.2.  Parameters  studied 

Several  categories  of  experiments  are  reported.  First,  a  two-dimensional  parameter  space  con¬ 
sisting  of  coarse  grid  resolution  and  overall  (  uniform)  resolution  is  exp'ored  by  numerical  experiment 
on  problem.,  #  ]  If).  \  non-restarted  GMRKS  algorit  hm  is  vised,  block-triangularlv  prerondit ioniul 
with  exact  solves  on  the  subdomain  interiors  and  on  the  coarse  grid,  and  with  tangential  inter¬ 
face  solves.  Here,  as  throughout  this  study,  we  use  exclusively  right,  preconditioning  and  an  initial 
iterate  of  zero.  The  goal  of  these  experiments  is  t  he  evaluation  of  the  alg  trithm  over  a  range  of  res¬ 
olution,.  in  terms  of  iteration  count  and  execution  time,  for  comparison  with  hack-of- f he-en v. dope 
rornplexit v  analvses. 
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Figure  7:  Surface  plots  of  the  test,  problem  solutions:  (a) 

#*.  (b)  #9,  (c)  #10.  (d)  #12. 

Another  set  of  experiments  is  performed  on  problems  #7-10  with  the  goal  of  evaluating  the 
economy  of  the  local  refinement  technique.  We  show  that,  local  uniform  mesh  refinement  is  capable 
of  significant  f'Pf  and  memory  savings  with  tto  sacrifice  of  accuracy  relative  to  uniform  refinement, 
but  that  improving  'he  discretization  in  simple  ways  can  be  more  effective  than  considerable  refine¬ 
ment.  In  a  third  set.  we  evaluate  the  effect  of  decomposition  orientation  for  non-unit-aspect  ratio 
tiles,  using  problems  #1  I.  The  limiting  cases  are  the  stripwise  decompositions  previously  consid¬ 
ered  by  us  in  ir29'.  In  another,  brief  proof-of-concept  section,  we  present  results  for  the  complex 
domain  problems.  #1  1  and  #12.  We  then  evaluate  different,  preconditioner  options  than  the  exact 
interior  solves  and  tangential  interface  solves  used  in  all  of  the  examples  above.  With  exact  interior 
-ol  vis.  we  com  pa  re  t  liree  di  fferent  interfaced  precondit  toners,  and  for  t  an  gent  ial  interface  solves,  we 
compare  three  different  interior  preconditioners,  l  inallv.  we  compare  our  preferred  options  in  this 
et  to  global  incomplete  factorizations  for  all  oi  the  problems  which  are  posed  on  square  domains. 
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- 

- 

- 

128 

1 

i 

1 

1 

1 

1 

1 

i 

1 

1 

1 

Table  1:  Iteration  count  as  a  function  of  number  of  tiles 
per  side  of  circumscribing  square,  t.  and  number  of  mesh 
points  along  a  tile  side,  m,  at  constant  refinement  parameter. 
h~l  =  128.  for  a  reduction  in  the  initial  residual  of  10-5.  The 
last  two  lines  of  the  table  are  not  available  experimentally  due 
to  minimum  discrete  subdomain  size  conventions  in  the  code; 
however,  the  last  line  consists  of  all  l's  bv  definition,  when 
t  =  h~K 

The  timings  given  below  are  from  a  Multiflow  Trace  14/200  computer  using  64-bit  reals.  All  of 
the  code  (primarily  in  C  but  with  FORTRAN  computational  kernals)  was  compiled  with  the  default 
(-03)  optimization  and  with  version  2.1.3  of  the  compilers.  Because  of  the  varying  performance  of 
hardware  (vector,  parallel,  superscalar)  on  different  problem  sizes  (due  to  different  startup  costs 
and  data  dependency  limitations),  execution  times  are  difficult  to  compare  directly.  The  reader 
should  keep  in  mind  while  studying  the  results  that  different  organizations  of  the  code  and  different 
compiler  capabilities  can  account  for  large  variations  in  times  across  architectures  and  software 
releases.  We  have  run  the  same  experiments  (to  the  extent  supported  by  memory)  on  two  other 
Unix  machines  and  find  that  the  proportion  of  time  spent  in  factorization  and  solution  phases 
varies  widely  between  machines  even  though  the  relative  rankings  of  total  timings  remain  mostly 
the  same.  In  addition,  replacing  the  nonsymmetric  bandsolver  in  UNPACK  used  to  solve  the  linear 
systems  with  a  custom  nonpivoting  routine  produces  a  large  benefit  on  one  computer  (a  factor  of 
three  reduction  in  time),  but  has  little  effect  on  another. 

4.3.  Convergence  as  a  function  of  coarse  grid  granularity 

In  order  to  test  coarse  grid  granularity  over  a  large  range,  we  fix  the  finest  mesh  spacing  at 
h-1  =  128  (relative  to  the  length  of  the  domain,  whether  that  be  1  in  the  first  seven  problems,  or 
2  in  the  next  three)  and  investigate  the  tradeoff  between  numbers  of  tiles  and  points  per  tile,  as 
shown  in  Tables  1  and  2  and  plotted  in  Figure  8.  The  mesh  is  identical  and  uniform  for  all  runs 
in  these  tables  (with  the  obvious  exception  that,  one  quadrant  of  it  is  not  present  in  the  L-shaped 
domain  problems.  #8-10,  which  therefore  lack  single-tile  entries).  The  convergence  criterion  is  a 
relative  reduction  in  residual  of  five  orders  of  magnitude.  Table  1  shows  that  the  iteration  count 
peaks  in  the  middle  of  the  granularity  range,  at  either  4  or  8  tiles  per  side.  The  bottom  row  of 
all  l's  can  be  supplied  without  benefit  of  actual  experiments,  since  it  represents  a  direct  solve  on 
a  single  grid.  The  top  row  entries  differ  from  1  in  problems  where  the  precondit ioner  has  different 
(lower  order)  boundary  conditions  than  the  operator  .4. 

Table  2  shows  the  rJeceptiveness  of  iteration  count  alone  as  a  measure  of  overall  performance. 
In  execution  time,  the  extreme  runs,  represent ing  single-domain  limiting  cases,  suffer  due  to  the 
high  cost  per  iteration,  even  though  the  number  of  iterations  required  is  very  small.  1  his  table  is  a 
profound  illustration  of  the  title  of  [  10];  Domain  Decomposition  Beneficial  Keen  Se  qucntinllii.  I  lie 
most  favorable  total  sequent  ini  execution  lim°s  are  found  for  multi-domain  cases  near  the  iteration 
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416 
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113 
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121 

119 
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Si 

i 

4  1 

42 

60 

48 

34  | 

41 

48 

46 

45 

25 

25 

23 

8 

16 

12 

1!) 

20 

25 

34 

27 

25 

14 

14 

12 

16  | 

s 

1 

13 

23 

23 

42 

27 

23 

9 

14 

12 

:  52 

4 

17 

32 

41 

55 

70 

47 

35 

36 

21 

17 

Table  2:  Execution  time  (sec)  as  a  function  of  number  of 
tiles  per  unit  length.  /.  and  number  of  mesh  points  along  a 
tile  side.  m.  at  constant  h~l  =  12S.  for  a  reduction  in  the 
initial  residual  of  10"°. 


Total  Execution  Time  (sec.) 


Figure  8:  Plots  of  Tables  1  and  2  (problems  #1-10  super¬ 
posed).  illustrating  that  the  minimum  execution  time  serial 
algorithm  occurs  near  t  =  16  tiles  on  a  side,  despite  the  large 
iteration  count  at  this  granularity. 


count  maxima,  in  particular  at  16  tiles  per  side. 

The  factorization  of  the  banded  matrix  in  the  single  subdomain  case  is  the  dominant  contri¬ 
bution  to  the  overall  time.  In  problems  #1-7,  over  410  seconds  are  spent  doing  the  factorization 
alone.  Of  course,  one  might  not  ordinarily  employ  exact  solves  on  the  single  domain  cases,  although 
many  structural  analysis  codes  do  this  very  thing.  A  comparable  penalty  will  accrue  in  an  attempt 
to  do  exact  solves  on  a  very  fine  “coarse"  grid,  in  which  each  tile  contains  just  one  point.  However, 
the  table  of  execution  times  is  truncated  beyond  tile  sizes  of  m  =  4. 

The  behavior  in  Table  2  can  be  understood  with  reference  to  back-of-the-envelope  complexity 
estimates  for  the  solution  and  factorization  operators  of  the  preconditioner.  We  observe  that  there 
are  Oil *)  cross-point,  interfaces,  and  interiors.  Naturally  ordered  banded  direct  factorizations  and 
solves  require  0(  X  b2)  and  0(  X b)  operators  respectively,  where  A  is  the  number  of  unknowns  and  b 
the  bandwidth.  For  the  cross-point  system.  X  1 2  and  b  ~  /;  for  the  interfaces.  A  =  m  and  b  =  1 : 
and  for  the  subdomain  interiors,  A  =  m2  and  h  =  m.  Thus,  the  interface  operation  counts  are 
always  asymptotically  subdominant  and  can  be  omitted  in  the  following.  From  choosing  the  larger  oi 
the  cross-point  and  interior  complexities,  we  see  that  factorization  costs  nia.X((,m){<A(  / '),  0(  t'm  1 )} 
and  solves  cost  rnaxj,  m|{0(/'!).0((2Hi!)}.  Since  m  =  128//  in  these  experiments,  the  first  term 
grows  with  /  and  the  second  decays  with  it.  Quick  calculations  reveal  that  (to  the  resolution 
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24 

31 

21 

12 

11 
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s 
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12 

20 

22 
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29 

28 

23 

17 

14 

16 

16 

128 

i£ 

17 

22 

27 

31 

25 

23 

16 

_ 

13 

10 

Table  3:  Iteration  count  as  a  function  of  number  of  tiles  per 
side  of  circumscribing  square,  /,  and  refinement  parameter, 
h~l .  at  const  ant  number  of  mesh  points  along  a  tile  side. 
m  =  8,  for  a  reduction  in  the  initial  residual  of  10~3. 

of  the  table)  the  minima  for  both  factorization  and  solve  costs  occur  at  or  between  t  =  Id  and 
32  when  h~l  —  128.  1'he  tendency  of  buffer  overhead,  neglected  in  the  estimates,  is  to  favor  a 
slightly  smaller  number  of  tiles  t  than  thus  estimated.  It  is  important  to  note  that  the  memory 
requirements  follow  the  solve  complexities  above.  Thus,  for  a  fixed  memory  size,  an  intermediate 
coarse  grid  granularity  accommodates  the  largest  problem  in  core.  Of  course,  all  of  these  per 
iteration  complexity  estimates  need  to  be  redone  when  the  preconditioner  blocks  are  other  than 
exact  solves,  for  instance,  incomplete  factorizations.  However,  incomplete  and  exact  factorizations 
differ  little  in  actual  cost  per  iteration  when  the  grid  is  narrow  enough  in  the  rapidly  ordered 
direction,  which  includes  the  case  of  small,  square  tiles. 

4.4.  Convergence  as  a  function  of  tile  refinement 

In  contrast  to  the  previous  section,  we  here  investigate  iteration  count  as  a  function  of  overall 
resolution,  for  a  fixed  number  of  subintervals  per  tile.  The  results  are  shown  in  Table  3.  The  global 
mesh  grows  in  refinement  from  16  to  128  as  the  number  of  points  per  tile  remains  constant  at  8.  In 
spite  of  the  fact  that  the  truncation  error  improves  with  at  least  h~l.  we  use  the  same  convergence 
tolerance  of  10~°  as  in  the  earlier  tables.  The  fine  grid  in  the  last  line  of  Table  3  corresponds  to 
the  t  =  16  case  of  the  earlier  tables. 

The  experiments  suggest  that  the  iteration  count  is  bounded  nearly  independently  of  h.  and 
thus  that  the  two-level  algorithm  is  nearly  optimal  asymptotically  in  the  constant,  m  limit.  In 
fact,  some  of  the  finest  mesh  results  are  even  relatively  better  than  preceding  coarser  ones.  This 
should  not  be  regarded  as  surprising,  since  there  is  a  steep  price  for  this  favorable  iteration  count 
when  m  is  hold  constant  and  h~l  is  increased,  namely,  a  larger  cross-point  system.  We  have  not 
pursued  any  theoretical  justification  for  this  bound,  but  the  theory  for  conjugate  gradient  iteration 
for  self-adjoint  problems,  see,  e.<y,  [6.  40],  contains  similar  results,  namely,  constant  upper  bounds 
on  the  iteration  count  for  constant  m. 

As  representative  convergence  histories,  we  present  Figure  9  which  follows  the  residual  reduc¬ 
tion  over  five  orders  of  magnitude,  and  the  time  versus  iteration  count  history  for  problems  #1 
and  #2.  The  latter  plots  reveal  the  quadratic  term  in  the  GMRES  work  estimate  that  comes 
from  the  need  to  orthogonalize  each  iterate  over  a  subspace  whose  size  grows  linearly  in  iteration 
count.  This  pair  of  figures  also  illustrates  the  poorer  conditioning  of  Neumann  problems,  since  the 
initial  iterates  and  t.ho  solutions  converged  to  are  identical,  and  so  are  the  operators  except  for  one 
Neumann  boundary  segment. 

4.5.  Economies  of  local  mesh  refinement 

Examples  #7  through  #10  allow  us  to  display  the  well-known  benefits  of  local  uniform  mesh 
refinement,  in  elliptic  problems:  comparable  accuracy  in  considerably  fewer  operations,  compared 
with  global  uniform  refinement.  We  solve  these  problems  at  refinement  levels  of  /)”'  =  32.  04.  12S. 
and  256.  based  on  the  global  grid,  but  perform  both  global  and  local  refinements  for  comparison, 
where  possible.  (The  finest,  global  refinement,  does  not  fit  into  the  memory  available,  which  is.  of 
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Figure  9:  Convergence  histories  for  problems  #1  and  #2,  for 
t  =  16,  m  =  8,  h~l  =  128.  (a)  and  (b)  show  the  normalized 
Euclidean  norm  of  the  residual  versus  iteration  count,  and 
(c)  and  (d)  show  time  versus  iteration  count. 

course,  another  of  the  main  motivations  for  LUMR.  along  with  execution  time  savings.)  All  of 
these  computations  were  made  with  a  reduction  in  the  residual  of  10-8.  so  that  the  measure  of  the 
error  would  not  be  contaminated  by  the  residual.  In  all  cases,  the  choice  of  where  to  refine  is  made 
by  hand.  In  a  forthcoming  paper  [23]  we  will  show  that  the  local  error  is  not  always  adequate  as 
an  indicator  of  the  optimal  refinement  location,  bince  we  are  interested  in  stuumg  how  domain 
decomposition  and  mesh  refinement  interact,  given  a  good  refinement  strategy,  we  eliminate  the 
latter  question  from  this  study. 

tables  4  through  7  compare  global  refinement  results  on  the  left,  and  local  on  the  right.  Each 
set  of  columns  lists  the  number  of  unknowns,  the  sup-norm  of  the  error,  the  number  of  iterations  to 
reduce  the  discrete  residual  by  8  orders  of  magnitude,  and  the  total  execution  time  thus  required. 

I  he  right-most  column  gives  the  execution  time  ratios  for  each  refinement  level.  Memory  use  ratios 
can  also  be  estimated  from  the  tile  structure  of  the  discrete  problem,  but  the  present  code  records  no 
explicit  allocation  measurements.  All  entries  share  a  constant  value  of  /  =  8  in  order  to  fix  regions 
ot  enhanced  refinement  that  do  not  shrink  as  li  does.  Therefore,  the  "global"  iteration  columns 
of  I  aides  1  through  7  comprise  a  convergence  study  which  is  complementary  to  both  Table  1  (in 
which  h  is  constant)  and  Table  3  (in  which  m  is  constant). 
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Figure  10:  Refinement  levels.  The  maximum  (third  level) 
local  uniform  refinements,  (a)  Problem  #7.  (b)  Problems 
#8  and  #9.  (c)  Problem  #10.  In  second  level  tests,  all  tiles 
showing  "3"  are  set  to  '"2”.  In  first  level  tests,  these  are 
further  reduced  to  “1".  In  zeroth  level  refinement,  all  tiles 
are  set  to  "0".  which  here  corresponds  to  m  =  8. 
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Table  4:  Number  of  unknowns  -V,  sup-norm  of  the  error 
c,  iteration  count  /.  and  execution  time  i  (sec)  for  problem 
#7  (internal  layer),  globally  and  locally  refined,  along  with 
execution  time  ratios,  for  a  reduction  in  the  initial  residual 
of  io~8. 

Ihe  behavior  of  iteration  count  with  each  doubling  of  global  refinement  in  the  se'f-  ad  joint 
problems  in  Tables  1  and  3  is  consistent  with  the  logarithmic  growth  in  conditioning  with  i>~[ 
proved  for  self-adjoint  problems  in  [(>].  The  locally  refined  exam  [ties  also  worsen  in  conditionin'.: 
with  h~x  when  /  is  held  constant,  but  the  CPU  time  advantage  of  local  in with 

h  ~  1 .  overall. 
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Table  5:  Number  of  unknowns  .V,  sup-norm  of  the  error 
c.  iteration  count  /.  and  execution  time  T  (sec)  for  problem 
#8  (reentrant,  cornei.  pure  diffusion),  globally  and  locally 
refined,  along  with  execution  time  ratios,  for  a  reduction  in 
the  initial  residual  of  ID-8. 
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Table  6:  Number  of  unknowns  .V,  sup-norm  of  the  error 
<: .  iteration  count  [.  and  execution  time  T  (sec)  for  problem 
#0  (reentrant  corner,  convective  inflow),  globally  and  locally 
refined,  along  with  execution  time  ratios,  for  a  reduction  in 
the  initial  residual  of  10“8. 
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Table  7:  Number  of  unknowns  .V,  sup-norm  of  the  error  e. 
iteration  count  /,  and  execution  time  T  (sec)  for  problem  #10 
(reentrant  corner,  convective  outflow),  globally  and  locally 
refined,  along  with  execution  time  ratios,  for  a  reduction  in 
the  initial  residual  of  10_s.  The  error  values  here  appear 
large,  but  are  in  fact  small  relative  to  the  size  of  the  solution. 

The  sup- norm  of  the  error  shows  sublincar  improvement  in  h  in  problems  #8  and  #9.  as  one 
expects  with  non-different iable  solutions.  The  second-order  accuracy  ot  the  discretization  is  readily 
apparent  (tin-  ratio  of  errors  is  almost  exactly  4  with  each  reduction  of  h  by  2)  in  problem  #i.  and 
the  first-order  accurate  treatment,  of  convection  in  problem  #10  leaves  its  signature  as  well. 

In  fable  x  we  show  the  benefit  of  rediscretization  of  the  tiles  surrounding  the  reentrant  corner 
in  problems  #8  and  #9  to  fit  the  discrete  solution  to  the  known  power-law  radial  dependence  oi 
the  singular  exact  solution  (see  the  problem  statements  above).  Rather  than  making  the  customary 
Taylor  series  assumptions,  we  take  u(r)  =  t/o  +  urp  4- hr“p.  where  p  is  derivable  from  a  local  analysis 
( see  (21  j).  Figure  11  displays  u(r)  along  the  ray  9  =  -y .  which  is  the  symmetry  axis  of  the  three 
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Figure  11:  Cross-section  of  u(r)  along  the  symmetry  axis. 

(a)  Problem  #8.  pure  diffusion,  non-differentiable  at  r  =  0. 

(b)  Problem  #9.  ccnvective  inflow,  strengthening  the  singu¬ 
larity,  ( <; )  Problem  #10.  convective  outflow,  eliminating  the 
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Table  8:  Sup-norm  of  the  error  e.  iteration  count  /,  and 
execution  time  T  (sec)  for  problems  #8  and  #9  locally  refined 
with  asymptotic  fitting,  along  with  the  ratio  of  the  error  to 
the  corresponding  local  entries  without  asymptotic  fitting  in 
Tables  5  and  6. 


L- shaped  problems. 

4.6.  Numerical  compromises  associated  with  domain  geometry 

I  he  domains  of  problems  #11  and  #12  provide  an  interesting  test  of  the  tile  decompositions 
advocated  herein  because  they  can  be  more  simply  described  with  less  restrictive  decompositions, 
l  or  instance,  if  the  only  restriction  on  the  decomposition  was  that  all  subdomains  had  to  be  rectan¬ 
gular.  the  first  has  a  two-subdomain,  and  the  second  a  five-subdomain  decomposition.  In  contrast, 
our  uniform-size  decompositions  require  a  minimum  of  48  and  23  tiles  respectively.  However,  be¬ 
cause  the  Neumann  boundary  conditions  of  #12  require  a  minimum  stencil  width  for  the  coarse 
grid  solve  in  the  preconditioner,  we  must  further  bisect  (in  each  coordinate  direction)  obtaining  a 
92-tile  decomposition.  Convergence  results  for  some  constant  h  discretizations  are  given  in  Tables  9 
and  10.  , 

Though  domain  geometry  prohibits  much  exploration  of  granularity  parameter  space,  we  note 
that:  fa)  the  practical  granularities  are  in  the  range  found  most,  useful  for  problems  #1  iu  in 
1  aides  1  and  2:  (it)  the  number  of  processors  available  in  a  typical  medium-scale  parallel  computer  / 

Isay  2*  through  28j  is  appropriate  for  tessellnfing  shapes  such  as  these,  which,  when  allowed  to 
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Table  9:  Execution  time  (sec)  as  a  function  of  number  of 
tiles  per  unit  length,  /.  and  number  of  mesh  points  along  a 
tile  side.  rn.  at  constant  /t_1  =  128.  for  a  reduction  in  the 
initial  residual  of  10~;>  on  problem  #11  (T-shaped  domain). 
1'here  are  12516  unknowns. 
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Table  10:  Execution  time  (sec)  as  a  function  of  number  of 
tiles  on  a  side.  t.  and  number  of  mesh  points  along  a  tile 
side.  m.  at  constant  h~x  —  160  subintervals  on  a  side,  for 
a  reduction  in  the  initial  residual  of  10-s  on  problem  #12. 
(Note  that  the  two-hole  domain  of  this  example  is  in  '0.  10]  x 
[0.10]).  There  are  2  1001  unknowns. 
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Table  11:  Iteration  count  for  different  preconditioner  block 
combinations  at,  constant  refinement  parameter.  /t"1  =  128. 
and  tessellation,  t  -  8.  rn  =  16.  for  a  reduction  in  the  ini¬ 
tial  residual  of  10“N  GMRES  was  restarted  after  every  100 
iterations.  —  indicates  that  the  iteration  had  not  converged 
after  500  steps. 


undergo  quasi-uniform  distortion,  are  sufficiently  general  for  a  large  class  of  typical  two-dimensional 
engineering  applications:  and  (r)  the  quasi-uniform  tiles  represent  quasi-uniform  quanta  of  work 
for  a  convenience  in  load-balancing  that  the  less  restrictive  minimum  tessellations  do  not  have. 

It  should  be  noted  that  tlie.se  problems  are  alternativelv  solved  very  successfully  by  embedding 
into  the  circumscribing  squares,  and  using  preconditioners  based  on  fast  solves  on  the  squares,  in 
what  is  known  as  t he  capacitance  mat rix  method  (see.  e.g..  [51]).  Complex  domains  are  often  better 
candidates  for  embedding  preconditioned  than  for  decomposition  preconditioned  in  terms  nf  the 
size  of  the  capacitance  system  (see.  (.<}..  fs]).  We  note  that  either  approach  can  lead  to  efleetive 
paralle|i-m.  since  readily  parallelized  fast  solvers  exist  [12]. 


4.7.  Tests  of  algorithmic  combinations 

lable-  11  and  12  explore  different  algorithmic  combinations  for  the  precoiid  it  inner  block'. 
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four  '’!!>•  i  *  > ! :  i ;  i  i  i  >  pre< -  *m  i  i » iouers  exact .  Ii  l  >  ( ) » .  il .  1 " «  1  > .  and  Mil. I’M). i  ,t;ni  t  i  n*  •*rt’;n  ■ 

. i i'll t  i<  >ner-  I  I’m)  i.  t  ru  neat  ed .  a  n>l  tangential  i.  as  <*.«■><•  ri  1  > < > <  1  in  sect  ion  2.  are  t  «*>i  e<l  mi  n  : 

--ellatimi  Im-  problems  #1  10.  Many  combination-  <li<l  not  converge  aft<T  100  it«»r;i * :<*:i -  •  > 

-!nrt i  </  (1MH1.S  117:  wa-  employed.  in  which  an  intermediate  solution  was  computed.  and  a  :  •  •••• 
Krylov  -nb-pace  begun  every  100  iterations.  I  p  to  live  re-tart  cycles  were  attempted.  Wot.  • 
proine*:.,  «  j  7  contain  10011  unknowns,  and  10  contain  12' 10.  so  a  union  of  <ubspac«-  •  m 
-i-t  i  nu  ,i  maxi  ii  mm  of  7,00  search  direct  ions  repre-ents  only  2  to  1  percent  of  the  dimension  a. ,  ■ ;. 
of  the  pi  oii.e  mi  ;  even  so.  it  is  hey  on  d  the  range  of  a!  t  ract  i  ve  performance  for  such  met  hod.- . 

1  vide  at  1  v.  t  he  interface  probe  technique  1  I'M)  i  does  not  work  as  u  ell  as  tlm  t  an gent  i  a !  ]>ree. 
tion.er  on  these  problems,  though  it  is  always  better  than  t  lie  truncation  preconditioner.  A  pn--, 
explanation  tor  the  poor  performance  of  the  I  I'M) :  int  erface  ha  ltd  I  i  t;  u;  is  that  probing  near  tin-  • 
point  s  i-  an  inaccurate-  rharacteri/af  ion  of  the  mutual  in  Hue  nee  of  points  on  i  nt  ersect  in  a  i  nt  erbn  •  - 
I  hough  1 ! * '  0 )  is  a  no* id  technique  for  adaptin';  interface  precondit  ionin';  to  coetficier.t  varni'  :•  . 

the  t allies  ill ust  rat e  that  the  st rail’ll! forward  version  for  stripwise  decomposition-  is  inellect  A ,  , ., 
•To-s-ix >int  problem  with  "short"  interior  interfaces.  Suitable  generalization-  of  t  he  interface 
technique  are  important  to  applications  because  tlm  information  required  to  construct  11’ -0, 
beibled  'iirectly  into  the  matrix  elements  of  the  linear  system  to  be  solved,  whereas  construction  ■  •: 
t  lie  t  a  ii tp'iit  ial  precondit  ioner  requires  j  n format  ion  about  t  he  original  di  ir<'r«>nt  ial  « iperator.  a  no  '  :  • 
relevant  collection  of  terms  is  not  defined  for  source-sink  operators. 

We  note  that  the  non-exact,  subdomain  solves  perform  very  poorly  for  the.-*’  problem-,  reiativ. 
to  the  exact  solves.  I  he  exception  i-  M 1 1. 1  (01.  which  performs  well  mi  the  Dirirhlet  probl*  m-. 

I  mure  12  is  a  Useful  diagnostic  lor  tli*’  poor  performance  of  maiiv  ol  the  cnmhinai  ions.  Sh"W  ■  m 
•  he  -ix  pa  n*’ls  is  a  su  rface  plot  of  t  he  ehunent  s  of  t  he  vector  /}  “ 1  f  (nr  problem  at  1 .  decom  [)o,-e.i  mo¬ 
an  s  <  s  array  of  Hi  <  1  f>  tiles.  A  good  precondit  inner  II  will  yield  a  plot  resembling  the  act  uni  -oiu  t  a  c 
'.f  the  pro  idem .  ii  =  j’pf  ; figure  0(a  ) ).  This  is  reasonably  well  approximated  by  com bi  na!  n  m  - 
ot  tangential  interface  preconditioning  and  exact  or  Mil. I  subdomain  precondit  inning.  I  Im  other 
combination-  with  rangeiiti.il  interface  preconditioning  -how  that  th*’  "w are- lia.-ki’t  part  .q  ■  1 ; .  ■ 
-"b  1 1  ion  i  -  well  -defined .  but.  that  t  he  subdomain  precondit  inning  i-  poor.  1 1.  Id  1  i  i  -  -light  ly  -  u  p.-t  ■  >; 
to  ||. I  Mil.  expeetr- <1.  and  as  more  hands  of  fill-in  are  permitted.  II  I  (  k  i  eventually  con  \  •  -ge- 
to  an  exait  m >1  ve  i  when  k  =  in  —  1  lor  the  five-point  operatort.  !  lie  combinations  with  1 1  ’ ■ 11 1 
and  t  ru  nca  t  e<|  interlace  precnrid  it  ion  ing-  -how  that  the  <jiia  lit  v  ol  the  prei  mnl  it  .on  i  tig  i-  1"-'  a  t  1  lie 
"Wire  !>a.-k‘‘l  Mage.  i  II  d  e  pell  d  e|i  1  ol  t  lie  -II  lx  lolll  a  I II  -olv<’-. 
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esidll.il 
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.  (  iMIi  1  S  wa 

-  ta  st  art ed  a i ' 
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everv  100  iter; 

t  i«  n  i  > . 

i '  ‘ii  t!ii’  narrower  bamlwidt  lis  i.\  the  st  r  i  ;>  ^  1 1 1  >  <  1  <  >  m ;  1  i  1 1  band'olvers.  and  an  • 
>'i .' 1  i  he  ■>"'  pronounced  )iad  we  Used  last  II  I  -implement  abb1.  -;>ar-e.  or  i : m  <•  j ; i ;».»■: «  •  :C 
-"I'.ers.  1  .ii, ill),  we  note  that  st  ri  pwi.se  dec,  >m  pi  oil  ii  mi  -  ca  a  e\ p  i  •  .0  ,t  ni'''i  i  npie.,  [;■  :  > :  ■  • !  • 

anee.  ki'epijii*  the  subdomains  unbroken  alone  > .irons’!  v  ■  otiph’d  dir*‘<  t  n  Me  i’l  1 

r  iiireit  ion  :  lead'  to  better  iteration  counts  and  execution  times  than  otherwise.  li-r.'.  ■••. «  ; 
fiomr  de<  ompositinris  are  a  nood  compromise  let  ween  the  "eom!  and  the  "!>a«;  r:  p  >  >:  •<  '■  ‘ 
In  problems  in  which  the  optimal  si  rip  orientation  ej*  j^r  varies  fr<  >m  location  !•■•  a‘  • 

•tn known  a  priori,  a  cross-point  decomposition  employim*  f i . . . . . •  tiumbc;  < d  -ub-t-e;-. 

. . I  alternative  to  -in ps. 

4.0.  Comparison  with  undccoinposcd  alternatives 

I  alnes  Id  and  1  ti  provide  a  more  realistic  com  pari -on  |.e*  v.  e.*t,  *.■{■'  dial  ami  •> 

approaches  to  prerondit  innini!  t  ban  'able-  1  ami  J.  Rec.d!  that  feijntt  itiu  ev.,,  *  mo..; -.  •  .• 

-  U  '  >‘i'  'III  a  I II  '  jiena  I  i/es  beyond  reason  the  pi  'll  of  III  a  tic  e  oi  Me’  . .  den  Uii  p . '  •'  •• 

we  u-e  •■.vo  Mandat'd  precomiit  io;..'r-.  Mil. I  >  tl  j  ami  114  :  i  m  pin . I  ev.ict  •  > 

-  e i  -ii  a  .  .|  pn  'idem-  :t  1  7. 

Ad  ca.se>.  inv.ilvini*  an  unscah-d  Neumann  or  Robin  boundary  coiniit  nm  ' ;  . 

pi  o!  :  I :  s  .  tail  to  con  veree  I II  -Ann  step'  I]  -  I  ley  the  ejoti.d  Mil  I  a  pproac  :  .  am!  •>:  ■  . '  ■ 
a; *  confound'  tin-  1 1. 1  precomiit  ion  i  m: .  1  he  t  ile-ha.~ed  pren md  it  j.  .,i  inn  .  i  n  1:*m  :  ‘ 

a  .i  ti  v  .  ' ,  ri,  veree,]  m  a  re|at  ivei  v  mod'  -t  number  of  , ter.it .n-  ■■ -r  di  proidem  -  I  Mli  I 
4 ! !  :o.'i  i  I,  e  ]e;el  s  In  'lie  III  s!  e\e(  ij!  i'  'll  Mlm  -■  It)  ’  !;<  ..  j,  .|  y.  j, ;•  .  •  v.  'd-.  - .  .  •  .  ■ 

Ibrt'd.iet  probii  an-.  More  importantly  for  bit  tire  need ..  ;  he  Me-  b.i.-i-d  appo'.e  :. 

”'T  pr.i-pe.  t  -  for  parallel  execution,  and  since  it  i-  compel  jtjw  w  *•  •  i.e  .ip:e' 

o>.  era  :l  :  i .  i  r .  <  1 1  •  •  1  e)h,'ienc\.  relative  to  the  /.as/  >»  mil  itltinnUttn  •  >'  «.*.*.•  t/  n>u  e 
(hi-  ’.'.ill  lie  t  r  I !  ‘  ■  *  ■ '.  > ' !  I  1,1!  1.1  rue.  dt  st  II  iei  t  I’l  |  tie  Idol  \  tli.eli  I  te  -  'A  :  M’i  O  ia! !'.  i  ;•>  -a  •  ■'  :  '  •  M 

<  "tn  it .  u  n  e  a  t  |.,n  .  -im  e  t  he  amount  o|  .v.inniuiin  atm:i  re, ;  M  t  ,,  domain  ■ 
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'I'ablo  1-1:  Kxenition  lime  (sec)  as  a  function  of  number 
of  lib's  per  horizontal  and  vertical  sides  of  circumscribin'; 
square,  at  constant  refinement  parameter.  h~l  =  12S.  for 
a  reduction  in  the  initial  residual  of  10“’.  (iMRKS  was 
restarted  alter  every  100  iterations.  indicates  that  t  lie  it 
eration  had  not  converged  after  700  steps. 


Table  15:  Iteration  count  for  problems  7  for  ;i  global 
M 1 1.1(0 Fprecondit  ioi.ed  C  M  R  I  IS.  a  global  11.1(1 )- precon <ii- 
t  ioned  (IMRKS.  and  an  t  ile  exact /tangent  ial  preconditioned 
(IMRKS  (for  I  ---  s.  w  =  Hi),  at  refinement  p;ira meter  - 
12  s.  for  ;t  reduction  in  the  initial  residual  of  10“’. 
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'Fable  10:  F. :<«•<■  u t ion  time  (sec)  for  problems  #1  7  for  a 
global  M  1 1. 1 ’( 0 )- precond  it  ioned  ((MR  IS.  a  global  II. I  (  1  )- 
preeondit  ioned  ( l.\l  I!  I'.S.  ami  an  t  ile  exact  /tangent  ial  precon 
dttioned  ( 1 M  R  I  ,S  ( for  I  =  s.  m  —  10).  at  relinement  parame¬ 
ter  h~l  12s.  for  a  reduction  in  the  initial  residual  of  It)' 

Mi. all.  particularly  compared  to  global  techlliipies. 

•">.  Conclusions  and  future  directions 

I  peri  1 1 1  •  ■  1 1 1  on  a  diverse  group  of  problems  demonstrate  that  a  two  level  domain  decomposj 

1  ion  algorithm  with  a  -ingle  global  mar "-e  grid  ran  provide  "nearlv'  optimal  convergent  e  and  allov. 


a  great  deal  of  flex'  bilit  v  in  refinement  strategy,  while  also  permitting  a  data  structure  amenable 
to  parallel  and  vector  implementations,  as  summarized  in  closing  below.  Although  oft<'n  moti¬ 
vated  by  parallelization,  domain  decomposition  may  also  yield  runtime  and  memory  use  benefits 
as  a  sequential  programming  paradigm.  Furthermore,  the  simple  structure  of  individual  blocks  ol 
the  domain-decomposed  preconditioner  means  that  new  applications  are  found  for  the  “standard 
solvers  "  in  conventional  software  libraries. 

The  traditional  economies  of  local  uniform  mesh  refinement  can  be  straightforwardly  incorpo¬ 
rated  into  the  domain  decomposition  framework  at  the  price  of  interface  handlers  with  conditional.' 
for  refinement  differences  between  adjacent  subdomains.  Because  of  the  highly  modular  nature  of 
a  standardized  tile-oriented  domain  decomposition  code,  custom  discretizations  for  certain  classes 
of  singularities  may  be  archived  into  applications  libraries  for  reuse. 

The  tile  algorithm  demonstrated  herein  in  a  superscalar  mode  on  a  Multiflow  computer  is 
amenable  to  vectorization  in  either  of  two  ways.  The  regular  operation  sequences  on  the  tensor- 
product  subgrid  arrays  are  precisely  the  type  for  which  vectorizing  compilers  were  conceived.  1  lm 
vector  lengths  depend  on  the  precise  form  of  solvers  used  in  the  preconditioner,  but  would  fend  to 
be  rather  small  for  the  rows  of  individual  S  x  8  or  16  X  16  tiles  found  best  in  the  two-dimensional 
applications  above.  An  alternative  form  of  vectorization  can  be  realized  by  grouping  together  all 
tib's  of  a  given  (discrete)  size  and  shape  and  operating  in  lock  step  on  corresponding  elements  in 
each  file,  assuming  an  identical  solver  is  applied  to  each.  A  vector  in  this  approach  consists  ol  the 
i  n  element  from  each  of  the  subdomains.  Our  8  x  8  arrays  of  tiles  would  be  thus  be  optmial  lor 
machines  with  a  vector  length  of  6-1. 

Parallelization  requires  careful  attention  to  the  load  balancer/mapper  and  also  to  the  coarse 
grid  solve  in  the  preconditioner.  Some  complexity  estimates  pertaining  to  alternative  forms  ol 
the  latter  may  bo  found  in  [24].  The  main  disadvantage  of  the  two-level  algorithm  in  the  parallel 
context  is  that  the  choice  of  coarse  grid  granularity  is  even  more  of  an  “over-determined  problem 
than  in  serial.  Communication  cost  per  iteration  and  convergence  properties  potentially  inveigh 
against  the  lower  bounds  imposed  by  domain  geometry,  solution  and  coefficient  smoothness,  and 
parallel  load  balance.  The  key  determination  for  future  applications  of  the  tile  methodology  will 
he  whether  this  over-determination  is  consistent  in  practice.  Inasmuch  as  the  examples  herein  an- 
represe*  ative  of  single-independent  variable  problems,  and  parallel  communication  costs  generally 
comprise  a  relatively  smaller  proportion  of  the  total  work  in  coupled  multi-component  problems, 
there  <ire  substantia!  grounds  for  optimism  that  this  will  be  the  case. 
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